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Abstract 

The  precision  timed  architecture  presents  a  real-time  embedded  processor  with  instruction-set  extensions  that 
provide  precise  timing  control  via  timing  instructions  to  the  programmer.  Programmers  not  only  describe  their  func¬ 
tionality  using  C,  but  they  can  also  prescribe  timing  requirements  in  the  program.  We  target  this  architecture  and 
present  a  static  scratchpad  memory  allocation  scheme  that  greedily  attempts  to  meet  these  timing  requirements.  Our 
objective  is  to  schedule  minimum  number  of  instructions  and  minimize  data  allocation  to  the  scratchpads  such  that 
timing  requirements  in  the  program  are  met.  Once  the  timing  requirements  are  satisfied,  the  remainder  of  the  scratch¬ 
pad  memory  can  be  used  to  optimize  some  other  metric  desired  by  the  programmer.  As  an  example,  we  minimize 
the  frequency  of  main  memory  accesses  in  the  program.  This  work  presents  the  following:  1)  high-level  timing  con¬ 
structs  for  C  that  synthesize  to  timing  instructions  and  2)  a  greedy  iterative  instruction  and  data  scratchpad  memory 
allocation  scheme  that  attempts  to  first  meet  the  specified  timing  requirements. 


1  Introduction 

In  real-time  embedded  systems,  scratchpad  memories  are  often  used  as  an  alternative  to  on-chip  cache  memories. 
They  offer  a  reduction  in  area,  low  power  and  low  energy  consumption  solution.  More  importantly,  they  provide  a 
high  degree  of  timing  predictability  [2,  10,  18,  20],  Scratchpads  do  not  require  hardware  policies  to  determine  the 
transfers  on  and  off  the  scratchpads,  thus  reducing  hardware  logic,  power,  area  and  improving  timing  predictability. 
But,  the  cost  of  using  scratchpads  is  that  the  programmer  is  responsible  for  scheduling  these  transfers.  Consequently, 
automatic  compiler-level  memory  allocation  schemes  are  desirable. 

There  are  numerous  scratchpad  memory  allocation  schemes  targeting  non-real-time  embedded  systems  [1,  12,  19] 
that  focus  on  improving  the  average-case  execution  time  (ACET)  of  a  program.  But  simply  improving  the  ACET 
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does  not  help  in  meeting  the  timing  requirements  of  a  real-time  embedded  application.  Alternatively,  worst-case 
execution  time  (WCET)  based  memory  allocation  schemes  for  real-time  embedded  systems  are  proposed  in  [17,  14, 
4],  These  approaches  minimize  a  task’s  worst-case  execution  path  by  scheduling  instructions  and  data  words  to  the 
scratchpads.  A  real-time  operating  system  (RTOS)  and  its  scheduling  algorithms  are  used  to  specify  the  programmer’s 
timing  requirements.  They  do  not  necessarily  make  an  effort  to  meet  the  programmer’s  intended  timing  requirements 
because  that  information  is  unavailable  in  the  program.  Unfortunately,  it  is  difficult  for  a  compiler  to  do  this  because 
today’s  programming  languages  do  not  specify  timing  requirements  and  most  embedded  processors  lack  mechanisms 
to  enforce  them. 

An  exception  is  the  precision  timed  (PRET)  architecture  proposed  by  Lickly  et  al  [9],  Edwards  and  Lee  [5]  recently 
made  the  case  to  reintroduce  time  as  a  first-class  property  at  all  abstraction  levels,  and  the  PRET  architecture  [9] 
presents  their  initial  efforts.  This  is  a  real-time  embedded  processor  that  provides  timing  as  predictable  as  its  logical 
function.  It  is  a  multithreaded  processor  based  on  the  SPARC  instruction-set  architecture  with  scratchpad  memories, 
extensions  for  precise  timing  control  and  no  RTOS.  It  accepts  C  programs  compiled  with  the  GCC  toolchain. 

While  using  the  PRET  architecture,  we  noticed  two  issues  that  make  its  practical  use  difficult.  They  are:  1)  timing 
requirements  are  encoded  via  low-level  assembly  commands  and  2)  there  is  an  assumption  that  instructions  and  data 
fit  entirely  on  the  scratchpads.  In  our  experience,  it  is  inconvenient  to  use  assembly-level  instructions  intermingled 
with  C  and  any  realistic  application  is  typically  greater  than  the  scratchpad  size.  Our  efforts  in  this  work  are  to  address 
these  two  issues. 


Figure  1 :  Simple  Example  using  Timing  Constructs 

We  address  the  first  issue  by  introducing  timing  constructs  in  C  with  associated  semantics.  These  synthesize  to 
PRET’s  timing  instructions  [7,  5,  9]  and  they  are  used  as  a  means  of  specifying  timing  requirements  in  the  program. 
The  synthesis  of  timing  instructions  is  a  source-to-source  transformation  for  which  we  use  an  open-source  C  front-end 
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parser  [3].  For  the  second  issue,  we  propose  a  static  memory  allocation  scheme  that  greedily  assigns  instructions  and 
data  words  on  the  scratchpads  to  satisfy  the  timing  requirements  specified  by  the  programmer. 

Typically,  scratchpad  memory  allocation  schemes  focused  on  improving  ACET  use  profile-based  methods,  and 
schemes  concerned  about  real-time  (WCET-based)  use  static  program  analysis.  The  latter  is  particularly  important  for 
safety-critical  and  hard  real-time  embedded  systems.  But,  we  are  interested  in  soft  real-time  embedded  applications 
where  timing  violations  as  a  result  of  not  exercising  the  absolute  worst-case  path  are  undesired,  but  not  hazardous. 
Furthermore,  constantly  evolving  architectures  and  traditional  input-data  dependent  programming  styles  hinder  the 
wide  use  of  static  WCET  analysis  [15].  Since  Lickly  et  al  [9]  clearly  indicate  that  the  PRET  architecture  is  still  in  its 
infancy  and  evolving,  we  opt  for  a  profile-based  method.  In  the  future,  we  plan  to  generate  probability  distributions 
from  the  profiled  data  and  then  perform  the  allocation  with  knowledge  of  the  expected  cases  as  well. 

In  this  work,  we  leverage  PRET’s  timing  instructions  [5,  9,  8]  and  introduce  timing  constructs  for  C  as  a  means 
of  specifying  timing  requirements  in  the  program.  We  then  shift  our  focus  to  a  static  memory  allocation  scheme  that 
greedily  assigns  instructions  and  data  words  to  scratchpads  to  satisfy  the  timing  requirements  and  then  stops,  leaving 
resources  available  for  non-real-time  functions.  Our  objective  is  to  schedule  the  minimum  number  of  instructions  and 
minimize  data  allocation  to  the  scratchpads  such  that  the  timing  requirements  specified  in  the  program  are  met.  This 
allows  the  remaining  scratchpad  space  to  be  used  as  the  programmer  sees  fit,  such  as  to  minimize  power  consumption 
or  the  execution  time.  We  present  one  possible  approach  that  uses  the  remaining  scratchpad  space  to  minimize  the 
frequency  of  main  memory  accesses  of  the  entire  program.  We  select  a  static  memory  allocation  scheme  approach 
because  dynamic  schemes  add  transfer  instructions  in  the  original  program;  possibly  resulting  in  changing  execution 
times.  To  our  knowledge,  this  is  the  first  scratchpad  memory  allocation  scheme  that  tries  to  meet  timing  requirements 
specified  in  programs. 

2  Related  Work 

Some  work  on  scratchpad  memory  allocation  schemes  for  non-real-time  embedded  applications  are  presented  in  [1, 
12,  19,  13,  16,  21].  These  approaches  focus  on  improving  ACET  and  investigate  different  criteria  such  as  allocation 
schemes  without  fixed  size  memories,  multiple  heterogeneous  memories  in  the  memory  hierarchy,  an  intersecting 
lifetime  criterion  for  selecting  arrays  to  allocate,  reducing  energy  costs  and  so  on.  However,  they  are  not  concerned 
with  real-time. 

There  are  several  works  that  focus  on  real-time  embedded  systems.  Suhendra  et  al.  [17]  present  a  static  WCET- 
based  data  allocation  scheme  for  scratchpads  using  static  program  analysis.  The  approach  is  to  iteratively  minimize 
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the  worst-case  execution  path  in  a  task.  Puaut  [14]  presents  a  dynamic  WCET-based  scratchpad  instruction  allocation 
scheme  and  later.  Deverge  and  Puaut  [4]  extend  this  allocation  scheme  to  data  as  well.  None  of  these  memory  alloca¬ 
tion  schemes  take  into  account  timing  requirements  that  may  be  specified  by  the  user  in  the  program  and  scheduling 
instructions  and  data  to  meet  them. 


3  Timing  Instructions  &  Constructs 


We  begin  by  summarizing  the  timing  instruction  extensions  for  PRET  called  deadline  instructions.  Using  these  timing 
instructions,  we  define  deadline  blocks  that  are  synthesized  from  our  timing  constructs  used  in  C. 
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Figure  2:  Memory  Allocation  Usage  Flow 


3.1  Deadline  Instruction 

The  deadline  instruction  introduced  by  Ip  and  Edwards  [7]  is  defined  as  a  pair  (dr,  ,  v )  where  dr,  is  a  deadline  register 
and  v  is  the  value  for  the  deadline  (0  <  *  <  7).  The  semantics  of  the  deadline  instruction  is  summarized  next.  The 
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deadline  instruction  sets  the  deadline  value  v  to  the  register  dr,;  if  dr,  is  zero  and  then  decrements  the  value  v  every 
clock  cycle.  Note  that  the  value  in  the  deadline  instruction  represents  a  cycle  timer.  If  however,  dr,  is  non-zero,  the 
deadline  instruction  blocks  until  dr,  becomes  zero,  after  which  the  new  value  v  is  loaded  into  the  register.  Additional 
implementation  details  of  the  PRET  architecture  are  explained  in  [9]. 

3.2  Deadline  Block 

We  define  a  deadline  block  d  as  a  pair  of  deadline  instructions  d  =  ((dr,,t>o),  (dry,  0)).  We  call  the  first  deadline 
instruction  (dr,,  t^o)  the  start  and  the  second  (dr,,  0)  the  end.  Notice  that  the  end  deadline  instruction  has  a  deadline 
register  value  0.  The  program  code  that  can  be  executed  between  a  start  and  end  deadline  instruction  is  said  to  be 
contained  within  the  deadline  block.  The  value  Vq  at  the  start  of  the  deadline  block  specifies  the  timing  requirement 
in  cycles  for  the  program  code  encapsulated.  A  timing  violation  of  this  requirement  is  when  the  code’s  execution 
within  the  deadline  block  takes  longer  to  execute  than  the  specified  value.  Within  a  deadline  block,  we  allow  most 
constructs  such  as  function  calls,  if-then,  for  and  while.  However,  we  disallow  control  jumps  that  jump  outside 
the  deadline  block  such  as  break,  continue,  return  and  goto.  Note  that  a  deadline  block  never  finishes  earlier 
than  its  specified  value  vq.  This  is  a  key  part  in  achieving  repeatable  execution  times  from  the  architecture.  The 
program’s  execution  stalls  if  the  execution  reaches  the  end  of  a  deadline  block  before  vq  reaches  0  and  it  resumes  once 
vq  is  0. 


3.3  Timing  Constructs 

We  introduce  two  broad  categories  of  timing  constructs:  sequential  and  loop.  The  sequential  constructs  (DEADSEQ  ( ) ) 
enforce  a  lower-bound  on  the  execution  time  on  a  sequential  segment  of  program  code.  On  the  other  hand,  the 
loop  constructs  (DEADFOR  ( )  ,  DEADWHILE  ( )  ,  .  .  . )  enforce  a  periodic  execution  of  the  code  within  the  scope. 

Figure  1  shows  an  example  of  the  use  of  timing  constructs  and  the  result  after  the  transformation.  Note  that  the 
arguments  after  the  deadline  register  values  are  unique  labels  used  for  profiling.  Also  note  that  we  have  specifically 
defined  the  synthesis  of  these  instructions  to  match  the  appropriate  semantics.  Currently,  we  encode  the  deadline 
register  in  cycles  as  well,  but  at  the  same  time  we  are  investigating  the  specification  of  timing  requirements  in  time 
and  frequency  units  that  are  again  automatically  synthesized  to  cycles  using  a  front-end  parser. 
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4  Profiler  and  Memory  Allocation  Usage  Flow 


Figure  2  presents  a  block-level  diagram  of  the  phases  involved  in  profiling  information  about  a  program  P.  Program 
P  is  a  C  program  with  timing  constructs.  Note  that  these  constructs  are  currently  only  supported  by  the  PRET 
architecture. 

Phase  1  receives  a  program  P  as  an  input  and  produces  multiple  subprograms  pt .  The  purpose  of  these  subpro¬ 
grams  is  to  exercise  different  paths  in  the  program  flow  of  the  original  program  P.  The  subprograms  can  be  either 
generated  manually  or  via  third-party  test  generation  tools.  Currently,  we  manually  create  the  subprograms. 

Phase  2  takes  these  subprograms  and  synthesizes  timing  instructions  from  the  timing  constructs  resulting  in 
program  tests  /,.  In  addition,  we  automatically  annotate  the  source  with  profiling  labels.  The  profiler  identifies  the 
start  and  end  of  deadline  blocks  via  these  labels.  A  profiler  command  file  c  is  generated  during  this  phase.  We 
use  the  open-source  C/C++  front-end  Clang  [3]  to  implement  the  necessary  rules  to  conduct  the  source-to-source 
transformations.  We  also  automatically  allocate  deadline  registers  to  the  timing  constructs.  Figure  1  shows  a  simple 
example.  By  performing  source-to-source  transformations  we  do  not  have  to  change  the  implementation  of  the  C 
standard  implemented  in  GCC  for  these  new  timing  constructs. 

The  actual  profiling  is  done  in  Phase  3.  The  program  tests  and  the  command  file  are  input  to  the  profiler  CSIM. 
The  output  of  the  profiler  are  data  sets  DST  for  every  test  /,.  The  information  these  data  sets  contain  are: 

•  For  every  deadline  block:  Execution  time,  instructions  encountered  and  their  frequency,  data  words  read  and 
written  to  and  their  frequency. 

•  Other  blocks:  instructions  encountered  and  their  frequency,  data  words  read  and  written  to  and  their  frequency, 
total  execution  time  of  test. 

The  final  Phase  4  is  where  the  memory  allocation  is  done.  The  first  part  of  the  memory  allocation  is  a  feasibility 
check  that  ensures  that  a  memory  allocation  that  results  in  meeting  the  timing  requirements  does  exist  for  some 
scratchpad  size.  If  one  does  exist,  then  a  greedy  algorithm  is  used  to  allocate  instructions  and  data  words  to  the 
scratchpads.  Note  that  there  is  a  back  arrow  to  Phase  3  because  after  an  allocation  is  performed,  the  profiling  is 
rerun.  This  process  continues  until  Phase  4  yields  an  allocation  that  meets  the  specified  timing  requirements  or  fails 
due  to  scratchpad  size. 
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5  Static  Memory  Allocation 


We  pictorially  represent  our  allocation  algorithm  in  Figure  3  and  describe  the  stages  briefly.  Note  that  Table  1  is  a  list 
of  variables  we  employ  throughout  the  remainder  of  the  paper. 
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set  of  tests, 
one  test  in  T. 

set  of  instructions  encountered  in  test  t. 
set  of  data  words  written  to/read  from  in  test  t. 
set  of  deadline  blocks  in  test  t. 
a  deadline  block. 

set  of  instructions  executed  in  deadline  block  d. 

set  of  instructions  executed  in  any  deadline  block. 

set  of  non-global  data  words  read  from  or  written  to  in  deadline  block  d. 

set  of  data  read  or  written  in  any  deadline  block. 

size  of  memory  unit  to. 

cycles  missed  for  deadline  block  d  when  in  main  memory. 

frequency  of  instruction  x  encountered. 

frequency  of  instruction  x  encountered  in  deadline  block  d. 

frequency  of  reading  data  word  y. 

frequency  of  reading  data  word  y  in  deadline  block  d. 

frequency  of  writing  data  word  y. 

frequency  of  writing  data  word  y  in  deadline  block  d. 

accessing  time  to  main  memory  via  the  memory  wheel. 

number  of  words  to  allocate  to  scratchpad  at  each  iteration. 

set  of  previously  scheduled  instructions. 

set  of  previously  scheduled  data  words. 


Table  1 :  Variable  Definitions 


5.1  Feasibility  Check 

Before  performing  any  allocation,  we  first  perform  a  feasibility  check.  We  say  that  a  program  P  is  infeasible  if  it 
cannot  meet  all  timing  requirements  for  every  program  test  In  order  to  check  for  infeasibility,  we  assume  that  all 
instructions  and  data  are  in  a  scratchpad,  and  run  each  of  the  tests.  The  feasibility  check  is  successful  if  none  of  the 
timing  requirements  are  violated.  Otherwise,  the  program  is  infeasible  and  the  user  must  alter  the  timing  requirements 
in  the  original  program. 

5.2  Memory  Allocation 

The  memory  allocation  in  Phase  4  is  refined  in  Figure  3.  This  step  is  reached  only  after  the  feasibility  check  passes. 
We  begin  by  identifying  deadline  blocks  that  violate  its  timing  requirements.  If  there  are  violations  but  the  existing 
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Figure  3:  Iterative  Static  Allocation  Flow 


allocation  has  occupied  the  maximum  size  of  the  scratchpad  memories,  then  for  the  given  scratchpad  memory  size  it 
is  not  possible  to  meet  the  timing  requirements.  On  the  other  hand,  if  there  is  space  available  on  the  scratchpads,  we 
perform  instruction  and  data  allocation  and  merge  the  existing  allocation  with  the  generated  allocation.  Note  that  once 
an  allocation  is  generated,  those  instructions  and  data  words  are  not  reconsidered  for  allocation  during  the  following 
iterations.  If  there  are  no  deadline  blocks  that  missed  their  deadlines,  then  we  have  successfully  found  an  allocation 
that  meets  all  timing  requirements.  In  this  case,  we  can  use  any  remaining  space  on  the  scratchpad  as  we  see  fit, 
including  for  scheduling  instructions  and  data  encountered  outside  of  the  deadline  blocks.  In  our  implementation,  we 
allocate  on  the  basis  of  the  frequency  of  main  memory  accesses  in  the  whole  program,  but  any  metric  could  be  used. 


5.3  Previously  Scheduled  Instructions  and  Data 

The  sets  PREinst  and  P RE data  hold  all  the  instructions  and  data  allocated  during  the  previous  iterations.  We  allocate 
these  instructions  and  data  words  at  the  beginning  of  each  iteration  and  remove  them  from  the  set  of  instructions  and 
data  considered  for  memory  allocation. 

5.4  Identify  Violating  Deadline  Blocks 

We  construct  a  set  of  deadline  blocks  V  that  violate  their  timing  requirements.  However,  it  is  possible  for  a  deadline 
block  to  violate  its  timing  requirements  in  more  than  one  test.  It  may  also  be  the  case  that  the  program  paths  taken  in 
the  tests  are  different  from  each  other;  thus,  different  instructions  and  data  words  are  encountered  during  profiling.  As 
a  result,  we  use  a  simple  heuristic  to  collate  the  instructions,  data  words  and  execution  times  of  these  deadline  blocks. 
For  all  tests  where  the  deadline  block  missed  its  deadline,  we  select  the  maximum  execution  time  of  the  deadline  block 
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and  the  union  of  both  instructions  and  data.  These  updated  deadline  blocks  are  added  in  set  V  and  their  corresponding 
information  is  merged  as  described  above.  Therefore,  any  indexing  based  on  deadline  blocks  in  V  yields  the  merged 
information. 


5.5  Exhausting  Scratchpad  Space 

At  every  iteration  we  check  if  the  remaining  space  in  the  scratchpad  allows  for  scheduling  ALLOC  addition  words 
of  instructions  or  data.  In  the  event  that  the  remaining  space  is  less  than  ALLOC,  we  alter  the  variable  size  to  that  of 
the  remaining  space.  If  however,  there  is  no  space  left  at  all  on  the  scratchpad  and  timing  requirements  are  violated, 
the  allocation  fails.  Otherwise,  if  there  are  no  timing  violations,  the  allocation  terminates  successfully  as  shown  in 
Figure  3. 


5.6  Instruction 

An  access  to  the  main  memory  goes  through  the  memory  wheel  [9]  that  can  take  at  worst  Amm  cycles  to  complete. 
For  a  deadline  block  d,  the  memory  access  cost  for  an  instruction  x  £  X (d)\P R Llnst  is  no  more  than  the  frequency 
of  that  instruction  N(x,  d)  multiplied  by  the  main  memory  access  time  Amm. 

We  formulate  a  binary  integer  linear  programming  problem  for  scheduling  both  instructions  and  data  encountered 
in  deadline  blocks.  We  will  start  by  defining  the  variables  that  correspond  to  instructions  that  could  be  added  to  the 
scratchpads. 

!1  allocate  instruction  x  to  scratchpad. 

0  otherwise. 

The  instruction  savings  IS(V )  is  given  below.  Instructions  that  contribute  most  to  the  instruction  access  time  from 
main  memory  are  more  heavily  weighted  than  those  that  do  not. 


is(v)  =  E  E  linst  (x)N(x,  d)Amm 

d&V  xeX{d)\PREinBt 


5.7  Data 


Once  again,  we  are  interested  in  scheduling  data  access  that  contribute  most  to  the  cost  in  memory  access  time.  Due 
to  the  simplicity  of  the  PRET  architecture,  accesses  for  data  to  main  memory  also  go  through  the  memory  wheel  with 
worst-case  access  time  of  Amm.  For  a  deadline  block  d,  a  data  word  y  £  Y(d)\PREliata  is  written  to  Nw(y,d) 
and  read  from  Nr(y,  d)  number  of  times.  Therefore,  the  memory  access  cost  for  y  is  the  memory  access  time  Amm 
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multiplied  by  the  sum  of  the  number  of  times  y  was  read  from  and  written  to  in  deadline  block  d. 


{1  allocate  data  word  y  to  scratchpad 
0  otherwise. 

The  savings  from  allocating  data  words  DS(V)  is  highest  when  data  words  that  contribute  most  to  the  main 
memory  access  cost  are  allocated. 

DS(V)  =  E  E  I  data  (y)  Arnm  j^-^r  (y  ■  6?) 

deV  veY(d)\PREdata 

+  Nw(y,d) 

5.8  Allocation 

Our  binary  integer  linear  program  shown  below  maximizes  the  savings  occupied  by  either  instruction  or  data  S(v). 
Thus,  the  objective  function  that  we  maximize  is  the  sum  of  the  savings  from  each  of  the  instruction  and  data  alloca¬ 
tions: 

Maximize: 

S(V)  =  IS(V )  +  DS(V) 

The  constraint  equations  must  make  sure  that  we  do  not  allocate  more  words  than  we  are  allowed  in  any  given 
iteration.  Remember  that  we  keep  history  of  the  previously  allocated  instructions  and  merge  the  new  ones  with  the 
previous  ones. 

Subject  to: 

Iinst{x)  +  Idat.a{y)  $  ALLOC 

xeX\PREinst  yeY\PREdata 

Solving  this  binary  integer  linear  program  gives  us  the  allocation  of  data  and  instructions  to  the  scratchpads  in  a 
single  iteration  of  our  allocation  scheme.  These  iterations  continue  as  shown  in  Figure  3  modifying  PREinst  and 
PREdata  to  reflect  the  contents  added  to  the  scratchpad  in  each  iteration. 
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5.9  Minimize  Frequency  of  Main  Memory  Accesses 

After  all  deadline  blocks  meet  the  timing  requirements,  one  final  iteration  takes  place  to  schedule  the  remaining 
scratchpad  space.  We  decide  to  minimize  the  frequency  of  main  memory  accesses  in  the  entire  program,  but  other 
optimizations  for  low  power  and  energy  are  also  possible.  We  define  INST  and  DATA  as  the  remaining  set  of 
instructions  and  data  words,  respectively. 


INST  =  |J  IN  ST(t)\PREinst 

teT 

DATA  =  |J  DAT A(t)  \PREdata 

teT 


We  redefine  the  binary  variables  for  the  instruction  and  data. 


t'instix) 


!1  allocate  instruction  x  to  scratchpad  in  final  phase. 
0  otherwise. 


{1  allocate  data  word  y  to  scratchpad  in  final  phase. 

0  otherwise. 

We  select  the  instructions  and  data  words  that  cost  the  most  in  terms  of  frequency  of  main  memory  accesses  by 
solving  another  binary  integer  linear  program.  In  this  case,  our  objective  function  is  S' (IN  ST,  DATA). 

Maximize: 


S' (INST,  DATA)  =  E  Iinst(x)N(x)+ 

xGINST 


yGDATA 


fdataiv)  Nr(y)  +  Nw(y) 


During  this  iteration,  we  want  to  allocate  all  of  the  remaining  space  on  the  scratchpad,  which  we  describe  in  the 
constraint  below. 
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Subject  to: 


E  4rt(y)  <  SIZE(spm)  -  SIZE{PREinst) 

xelNST 

E  ^ataiy)  <  SIZE(spm)  -  SIZE(PREdata) 

y^DATA 


6  Experiments 

Our  infrastructure  uses  Clang  [3]  to  perform  the  source-to-source  transformations  from  timing  constructs  in  C  to 
timing  instructions  compilable  by  SPARC’s  GCC  toolchain.  We  use  Python  for  the  profiling  and  integration  with 
PRET’s  cycle-accurate  simulator  and  lp  solve  as  the  solver  for  selecting  the  instructions  and  data  words  that  offer 
the  most  savings  in  terms  of  main  memory  access  time.  We  select  a  few  C  benchmarks  from  the  Malardalen  [6]  suite 
and  augment  them  with  timing  requirements. 


Figure  4:  Execution  Times  with  Varying  Scratchpad  Size 


In  Figure  4,  we  vary  the  scratchpad  size  (both  instruction  and  data)  as  a  percentage  of  the  program  instructions 
and  data,  and  present  the  execution  times  of  the  benchmarks.  At  0%  only  the  main  memory  is  used.  We  also  denote 
whether  allocation  failed  at  each  of  the  scratchpad  sizes;  a  circle  denotes  failure  and  a  triangle  denotes  success.  As 
shown,  the  selected  benchmarks  eventually  met  their  timing  requirements.  In  addition  to  the  benchmarks  shown  in 
Figure  4,  we  plan  to  port  segments  of  real-time  programs  from  PapaBench  [11], 
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7  Conclusion 


The  PRET  architecture  makes  the  execution  time  of  programs  repeatable  through  its  timing  instruction  extensions.  It 
also  allows  the  programmers  to  specify  their  timing  requirements  within  their  programs,  unlike  traditional  embedded 
programming  languages  like  C.  In  this  work,  we  leverage  these  timing  instructions  and  provide  high-level  timing  con¬ 
structs  to  specify  timing  requirements  in  the  program.  Afterward,  we  present  a  static  scratchpad  memory  allocation 
scheme  that  focuses  on  meeting  these  timing  requirements  in  the  program.  This  effort  removes  the  assumption  made 
by  Lickly  et  al.  [9]  of  the  program  and  data  having  to  fit  entirely  on  the  scratchpads.  In  our  approach,  a  programmer 
specifies  these  timing  requirements  in  the  program  using  timing  constructs.  We  automatically  synthesize  timing  in¬ 
structions  from  these  timing  constructs  and  additional  profiling  annotations.  We  iteratively  schedule  instructions  and 
data  words  to  the  scratchpads  for  blocks  that  specify  the  programmer’s  timing  requirements  such  that  the  least  amount 
of  scratchpad  space  is  used  to  meet  the  timing  requirements.  If  and  once  they  are  satisfied,  any  remaining  space  can 
be  used  to  optimize  based  on  another  metric,  such  as  reducing  power  or  energy  consumption. 
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